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Method and arrangement for determining a sequence of 
actions for a system which has states, a transition in 
5 state between two states being performed on the basis 
of an action 

The invention relates to a method and an 
arrangement for determining a sequence of actions for a 
10 system which has states, a transition in state between 
two states being performed on the basis of an action. 

Such a method and such an arrangement are known 
from [1] . 

A financial market is described in [1] as an 
15 example for such a system which has states. 
: The system is described as a Markov decision 

problem (MDP) . The structure of a system which can be 
tj\ described as a Markov decision problem is illustrated 

in Figure 2 , 

CO — * 

y 20 The system 201 is in a state x t at an instant 

t. The state x t can be observed by an observer of the 
system. On the basis of an action a t from a set in the 
state x t of possible actions, a t e A(x t ), the system 
O makes a transition with a certain probability into a 

^ 25 subsequent state x t +l at a subsequent instant t+1. 

This is illustrated diagrammatically in Figure 
2 by a loop. An observer 200 perceives 202 observable 
variables concerning the state x t and takes a decision 
via an action 203 with which it acts on the system 201. 
30 The system 201 is usually subject to the interference 
205. 

Furthermore, the observer 200 obtains a gain r t 

204 
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r t = r(x t , a t , x t + 1 ) € 3? , 



(1) 



which is a function of the action a t 203 and the 
original state x t at the instant t as well as of the 
subsequent state x t +l of the system at the subsequent 
instant t+1. 

The gain r t can assume a positive or negative 
skalar value, depending on whether the decision leads, 
with regard to a prescribable criterion, to a positive 
or negative system development, in [1] to an increase 
in capital stock or to a loss. 

In a further time step, the observer 200 of the 
system 201 decides on the basis of the observable 
variables 202, 204 of the subsequent state x t+1 in favor 
of a new action a t +i, etc. 

A sequence of 



State: 




x t 


e 


X 


Action: 




a t 


e 


A(x t ) 


Subsequent 


state : 


x t +l 


e 


X 


Gain 


r t = 


r(x t , a t , x t+1 ) 




9? 



etc. describes a trajectory of the system which is 
evaluated by a performance criterion which accumulates 
the individual gains r t over the instants t. It is 
assumed by way of simplification in a Markov decision 
problem that the state x t and the action a t all contain 
information for the purpose of describing a transition 
probability p(x t +il*) of the system from the state x t to 
the subsequent state x t +i- 

In formal terms, this means that: 

pt x t + l|x t ,K / x 0 , a t' K ' a o) = p(x t + l|*f a t) • (2) 
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p (x t +ilx t , a t ) denotes a transition probability for the 
subsequent state x t +i for a given state x t and given 
action a t . 

In a Markov decision problem, future states of 
the system 201 are thus not a function of states and 
actions which lie further in the past than one time 
step . 

The characteristics of a Markov decision 
problem are represented below by way of summary: 



X set of possible states of the system, 

e.g. X = 9T, 

A(x t ) set of possible actions in the state 

p (x t +i I x t ,a t ) x t 

r(x t , a t , x t +i) gain with expectation R(x t , a t ) . 



Starting from observable variables, the 
variables denoted below as training data, the aim is to 
determine a strategy, that is to say a sequence of 
functions 

* = {^0' 1*1/ K. , h t } , (3) 

which at each instant t map each state into an action 
rule, that is to say action 

H t (x t ) = a t (4) 

Such a strategy is evaluated by an optimization 
function. 
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The optimization function specifies the 
expectation, the gains accumulated over time at a given 
strategy 7t, and a start state x 0 . 

The so-called Q-learning method is described in 
[1] as an example of a method of approximative dynamic 
programming. 

An optimum evaluation function V* (x) is defined 

by 

V*(x) = max V^x) VxeX (5) 



where 
V 7l (x) = E 



GO 



^ y t r(x t , fi t/ x t + 1 )|x 0 = x 
Lt = 0 



(6) 



y denoting a prescribable reduction factor which is 
formed in accordance with the following rule: 
1 



Y = 



1 + z 



(7) 



z e 9T 



(8) 



A Q-evaluation function Q*(x t ,a t ) is formed within the 
Q-learning method for each pair (state x t/ action a t ) in 
accordance with the following rule: 



Q*(x t , a t ) = ]T P( x t + lta:' * t ) * r t + 



xex 



+Y * Z p( x l x t' a t)* max(Q*(x, a)) 
xeX a€A 



(9) 



y 6 



tu 
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On the basis respectively of the tupel (x t , 
x t +i/ a t/ r t ) r the Q-values Q* (x,a) are adapted in the 
k+1 th iteration in accordance with the following 
learning rule with a prescribed learning rate r| k in 
5 accordance with the following rule: 

Qk + l( x f a t) = i 1 ' Tlk)Qk( x t^ a t) + *lk *t + Y max(Q k (x t + 1 , a)) . (10) 

v a eA / 

Usually, the so-called Q-values Q*(x,a) are 
approximated for various actions a by a function 
approximator in each case, for example a neural network 
10 or else a polynomial classifier, with a weighting 
vector w a , which contains weights of the function 
approximator . 

O A function approximator is to be understood as, 

for example, a neural network, a polynomial classifier 
r| 15 or else a combination of a neural network with a 

4 polynomial classifier. 

It therefore holds that: 

Q*(x, a) * Q^x/ w a J . (11) 

Changes in the weights in the weighting vector 
M 20 w a are based on a temporal difference d t which is formed 

in accordance with the following rule: 

d t : = r(x t , a t , x t + i) + Y max Q( x t + 1'* w k) " oUt'* "k* J (12) 

aeA 

The following adaptation rule for the weights 
of the neural network, which are included in the 
25 weighting vector w a , follows for the Q-learning method 
with the use of a neural network: 
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"ftl - w k r + 1k «t • V Q (x t ;w^) 



(13) 



The neural network representing the system of a 
financial market, as described in [1], is trained using 
the training data which describe information on changes 
in prices on a financial market as time series values. 

A further method of approximative dynamic 
programming, the so-called TD (k) -learning method, is 
known from [2] and is explained in more detail in 
conjunction with an exemplary embodiment. 

Furthermore, it is known from [3] which risk is 
associated with a strategy n and an initial state x t . A 
method for risk avoidment is likewise known from [3] . 

The following optimization function, which is 
also referred to as an expanded Q-function Q^Xt, a t ) , is 
used in the method known from [3] : 



maximize 

/ 

Q*( x t' a t}- = 4*t> a t* xt + l) + inf \ Z Y k r(x k , 7c(x k ), x k + 1 ) 



xp,xi,K 

p(xQ, XI, K)>( 



lk = l 



(14) 

The expanded Q-function Q n< x t , a t ) describes the 
worst case if the action a t is executed in the state x t 
and the strategy n is followed thereupon. 

The optimization function Q B< x t , a t ) for 



GR 98 P 2663 

- 7 - 

Q*(x t , a t ): = max Q n (x t , a t ) 
it ell 

(15) 

is given by the following rule: 

Q*(x t , a t ) = min r ( x t' a f x ) + Y * max Q*( x ' a ) 

xeX ^ aeA / 

p(x t + 1 |x t ra t )>0 



(16) 



5 A substantial disadvantage of this mode of 

procedure is to be seen in that only the worst case is 
taken into account when finding the strategy. However, 
this reflects the requirements of the most varied 
technical systems only to an inadequate extent. 
sli 10 Furthermore, it is known from [4] to formulate 

" access control for a communications network and the 

routing within the communications network as a problem 
Eft of dynamic programming. 

!=? The invention is therefore based on the problem 

m 

„ 15 of specifying a method and an arrangement for 

Q determining a sequence of actions for a system, in 

which method or action an increased flexibility in 

[II 

I., determining the strategy is achieved. 

□ The problem is solved by the method and by the 

r '''* 20 arrangement in accordance with the features of the 

independent patent claims. 

In a method for computer-aided determination of 
a sequence of actions for a system which has states, a 
transition in state between two states being performed 
25 on the basis of an action, the determination of the 
sequence of actions is performed in such a way that a 
sequence of states resulting from the sequence of 
actions is optimized with regard to a prescribed 
optimization function, the optimization 

30 
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function including a variable parameter with the aid of 
which it is possible to set a risk which the resulting 
sequence of states has with respect to a prescribed 
state of the system. 
5 An arrangement for determining a sequence of 

actions for a system which has states, a transition in 
state between two states being performed on the basis 
of an action, has a processor which is set up in such a 
way that the determination of the sequence of actions 
10 can be performed in such a way that a sequence of 
states resulting from the sequence of actions is 
optimized with regard to a prescribed optimization 
function, the optimization function including a 
variable parameter with the aid of which it is possible 
tfj 15 to set a risk which the resulting sequence of states 

^1 has with respect to a prescribed state of the system. 

K \ It becomes possible for the first time owing to 

yl the invention to specify a method for determining a 

^ sequence of actions at a freely prescribable level of 

Co 

' 20 accuracy when finding a strategy for a possible closed- 

Q loop control or open-loop control of the system, in 

general for influencing it. 
J*f Preferred developments of the invention follow 

p from the dependent claims. 

* fc " 25 The developments described below are valid both 

for the method and for the arrangement, the processor 
being respectively set up in the development of 
arrangement in such a way that the development can be 
implemented. 

30 In a preferred refinement, a method of 

approximative dynamic programming is used for the 
purpose of determination, for example a method based on 
Q-learning or else a method based on TD (A,) -learning . 
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Within Q-learning, the optimization function 
OFQ is preferably formed in accordance with the 
following rule: 



OFQ = q(x; w a ), 



• x denoting a state in a state space X 

• a denoting an action from an action space A, and 

• w a denoting the weights of a function approximator 
which belong to the action a. 

The following adaptation step is executed 
during Q-learning in order to determine the optimum 
weights w a of the function approximator: 

w t +1 = w t t + It • K K (d t ) ' VQ(x t ; w at ) 

with the abbreviation 
d t = r(x t , a t , x t + 1 ) + y max Q(x t + i, w^l - qI x t/ J 

• x t , x t +l respectively denoting a state in the state 
space X, 

• a t denoting an action from an action space A, 

• y denoting a prescribable reduction factor, 

• w t l denoting the weighting vector associated with 
the action a t before the adaptation step, 

• w t+l denot i n 9 the weighing vector associated with 

the action a t after the adaptation step, 

• T] t (t = 1, . . . ) denoting a prescribable step size 
sequence, 
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k e [-1; 1] denoting a risk monitoring parameter, 
K K denoting a risk monitoring function tt K (£) = 
(1 - icsign<£))§, 

VQ(-;-) denoting the derivation of the function 



approximator according to its weights, and 

• r(x t/ a t , x t+x ) denoting a gain upon the transition 
of state from the state x t to the subsequent state 
x t +i- 

The optimization function is preferably formed 
in accordance with the following rule within the TD(A.)- 
learning method: 

OFTD = J(x;w) 

• x denoting a state in a state space X, 

• a denoting an action from an action space A, and 

• w denoting the weights of a function approximator. 

The following adaptation step is executed 
during TD (X) -learning in order to determine the optimum 
weights w of the function approximator: 

w t +i = w t + Tit • tt K (d t ) • z t 

with the abbreviations 

d t = r(w t , a t , x t+ i) + yJ(x t+1 ; w t ) - J(x t ; w t ) , 



= X • y • Zt-x + Vj(x t ; w t ) , 
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• x t , x t+1 respectively denoting a state in the state 
space X, 

• a t denoting an action from an action space A, 
5 • y denoting a prescribable reduction factor, 

w t denoting the weighting vector before the 
adaptation step, 

• w t+i denoting the weighting vector after the 
adaptation step, 

10 * nt (t = 1, ...) denoting a prescribable step size 
sequence, 

• k e [-1; 1] denoting a risk monitoring parameter, 

• K K denoting a risk monitoring function K K (£) = 
(1 - Ksign(^))^, 

15 * v J<-;-> denoting the derivation of the function 
\j approximator according to its weights, and 

?j * r(x t , a t , x t+1 ) denoting a gain upon the transition 

At of state from the state x t to the subsequent state 

h£I Xt+1 • 

m 20 

□ The system is preferably a technical system of 

kl which before the determination measured values are 

measured which ar e used in determining the sequence of 
actions . 

25 The technical system can be subjected to open- 

loop control or else closed-loop control with the use 
of the determined sequence of actions. 

The system is preferably modeled as a Markov 
decision problem. 

30 The method or the arrangement is preferably 

used in a traffic management system or in a 
communications system, the sequence of actions being 
used in a communications network to carry out access 
control or a routing, that is to say a path allocation. 
35 Furthermore, the system can be a financial 

market which is modeled by a Markov decision problem, 
the change in the financial market, for example the 
change in an 



■i - 



f" 




index of stocks or else a rate of exchange on a foreign 
exchange market being analyzed by using the method 
and/or the arrangement and it being possible to 
intervene in the market in accordance with the sequence 
of determined actions. 

Exemplary embodiments of the invention are 
illustrated in the figures and explained in more detail 
below. 



Figure 1 shows a flowchart in which individual method 
steps of the first exemplary embodiment are 
illustrated; 

Figure 2 shows a sketch of a system which can be 
modeled as a Markov decision problem; 

Figure 3 shows a sketch of a communications network in 
which access control is carried out in a 
switching unit; 

Figure 4 shows a symbolic sketch of a function 
approximator with the aid of which a method 
of approximative dynamic programming is 
implemented; 

Figure 5 shows a further sketch of a plurality of 
function approximators with the aid of which 
approximative dynamic programming is 
implemented; and 

Figure 6 shows a sketch of a traffic management system 
which is subjected to closed-loop control in 
accordance with an exemplary embodiment. 
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First exemplary embodiment: access control and routing. 

Figure 3 shows a communications network 300, which has 
a multiplicity of switching units 301a, 301b, 
301i, . . . 301n, which are interconnected via 
connections 302a, 302b, 302 j, ... 302m. 

Furthermore, a first terminal 303 is connected 
to a first switching unit 301a. From the first terminal 
303, the first switching unit 301a is sent a request 
message 304 which requests preservation of a prescribed 
bandwidth within the communications network 300 for the 
purpose of transmitting data (video data, text data) . 

It is determined in the first switching unit 
301a in accordance with a strategy described below 
whether the requested bandwidth is available in the 
communications network 300 on a specified, requested 
connection (step 305) . 

The request is refused (step 306) if this is 
not the case. 

If sufficient bandwidth is available, it is 
checked in a further checking step (step 307) whether 
the bandwidth can be reserved. 

The request is refused (step 308) if this is 
not the case. 

Otherwise, the first switching unit 301a 
selects a route from the first switching unit 301a via 
further switching units 301i to a second terminal 309 
with which the first terminal 303 wishes to 
communicate, and a connection is initialized (step 
310) . 
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The starting point below is a communications 
network 300 which comprises a set of switching units 

N- {l, K , n, K , N} (17) 
and a set of physical connections 

L= {l, K. , 1, K , l} , (18) 

a physical connection 1 having a capacity of B(l) 
bandwidth units . 
A set 

M= {l, K , m, K , M} (19) 

of different types of service m are available, a type 
of service m being characterized by 

• a bandwidth requirement b(m), 

• an average connection time — - — and 

V(m) 

• a gain c(m) which is obtained whenever a call 
request of the corresponding type of service m is 
accepted. 

The gain c (m) is given by the amount of money 
which a network operator of the communications network 
300 bills a subscriber for a connection of the type of 
service. Clearly, the gain c (m) reflects different 
priorities, which can be prescribed by the network 
operator and which he associates with different 
services . 

A physical connection 1 can simultaneously 
provide any desired combination of communications 
connections as long as the bandwidth used for the 
communications connections does not exceed the 
bandwidth available overall for the physical 
connection. 
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If a new communications connection of type m is 
requested between a first node i and a second node j 
(terminals are also denoted as nodes), the requested 
communications connection can, as represented above, 
either be accepted or be refused. 

If the communications connection is accepted, a 
route is selected from a set of prescribed routes. This 
selection is denoted as a routing. b(m) bandwidth units 
are used in the communications connection of type m for 
each physical connection along the selected route for 
the duration of the connection. 

Thus, during access control (call admission 
control), a route can be selected within the 
communications network 300 only when the selected route 
has sufficient bandwidth available. 

The aim of the access control and of the 
routing is to maximize a long term gain which is 
obtained by acceptance of the requested connections. 

At an instant t, the technical system which is 
the communications network 300 is in a state x t which 
is described by a list of routes via existing 
connections, by means of which lists it is shown how 
many connections of which type of service are using the 
respective routes at the instant t. 

Events w, by means of which a state x t could be 
transferred into a subsequent state x t+i , are the 
arrival of new connection request messages, or else the 
termination of a connection existing in the 
communications network 300. 

In this exemplary embodiment, an action a t at 
an instant t owing to a connection request is the 
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decision as to whether a connection request is to be 
accepted or refused and, if the connection is accepted, 
the selection of the route through the communications 
network 300. 

The aim is to determine a sequence of actions, 
that is to say clearly to determine the learning of a 
strategy with actions relating to a state x t in such a 
way that the following rule is maximized: 

E Z e ~ Ptk • s( x t k / «> k , a tR ) , (20) 
Vk = 0 J 

• E{.} denoting an expectation, 

• t k denoting an instant at which a kth event takes 
place, 

• g(x tk ,QD k ,a tk ) . denoting the gain which is associated 
with the kth event, and 

• P denoting a reduction factor which evaluates an 
immediate gain as being more valuable than a gain 
at instants lying further in the future. 

Different implementations of a strategy lead 
normally to different overall gains G: 

CO 

G = Z e ~ Ptk * g( x t k r CD k , a tk ). (21) 
k = 0 

The aim is to maximize the expectation of the 
overall gain G in accordance with the following rule J: 
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J = B 



k = 0 J 



(22) 



it being possible to set a risk which reduces the 
overall gain G of a specific implementation of access 
control and of a routing strategy to below the 
expectation. 

The TD (X) -learning method is used to carry out 
the access control and the routing. 

The following target function is used in this 
exemplary embodiment : 



J*(x t ) = E T 



{e ^Jeqj jmax|g(x t , © t , a) + J*(x t + 1 )]J, 



(23) 



• A denoting an action space with a prescribed 
number of actions which are respectively available 
in a state x t/ 

• x denoting a first instant at which a first event 
(0 occurs, and 

• x t +i denoting a subsequent state of the system. 

An approximated value of the target value 
J* (x t ) is learned and stored by employing a function 
approximator 400 (compare Figure 4 ) with the use of 
training data. 

Training data are data previously measured in 
the communications network 300 and relating to the 
behavior of the communications network 300 in the case 
of incoming connection requests 304 and of termination 
of messages. This time sequence of states is stored, 
and these training data are used to train the function 
approximator 400 in accordance with the learning method 
described below. 
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A number of connections of in each case one 
type of service m on a route of the communications 
network 300 serve in each case as input variable of the 
function approximator 400 for each input 401, 402, 403 
of the function approximator 400. These are represented 
symbolically in Figure 4 by blocks 404, 405, 406. 

An approximated target value J of the target 
value J* is the output variable of the function 
approximator 400. 

Figure 5 shows a detailed representation of the 
function approximator 500, which in this case has 
several component function approximators 510, 520 of 
the function approximator 500. One output variable is 
the approximated target value J , which is formed in 
accordance with the following rule: 

J(x t ,e). £5«f*«eW). ««, 

1 = 1 

The input variables of the component function 
approximators 510, 520, which are present at the inputs 
511, 512, 513 of the first component function 
approximator 510, or at the inputs 521, 522 and 523 of 
the second component function approximator 520 are, in 
turn, respectively a number of types of service of a 
type m in a physical connection r in each case, 
symbolized by blocks 514, 515, 516 for the first 
component function approximator, and 524, 525 and 52 6 
for the second component function approximator 520. 

Component output variables 530, 531, 532, 533 
are fed to an adder unit 540, and the approximated 
target variable J is formed as output variable of the 
adder unit. 

Let it be assumed that the communications 
network 300 is in the state x tk and that a request 

message with which a type of service m of class m is 
requested for a connection 
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between two nodes i, j reaches the first switching unit 
301a. 

A list of permitted routes between the nodes i 
and j is denoted by R(i, j), and a list of all possible 
routes is denoted by 

£(i, j/ x tj J c R(i, j) (25) 

as a subset of the routes R(i, j) which could implement 
a possible connection with regard to the available and 
requested bandwidth. 

For each possible route r, reR(i, j,x tk ) , a 
subsequent state x tk +l(x tk , co k ,r) is determined which 
results from the fact that the connection request 304 
is accepted and the connection on the route r is made 
available to the requesting first terminal 303. 

This is illustrated in Figure 1 as second step 
(step 102), the state of the system and the respective 
event being respectively determined in a first step 
(step 101) . 

A route r* to be selected is determined in a 
third step (step 103) in accordance with the following 
rule : 

r* = arg ^ max ^( x t k +l( x t k ' ffl k' r )' ®t) • < 26 > 
r €R(i,j,x tk j 

A check is made in a further step (step 104) as 
to whether the following rule is fulfilled: 

c(m) + j(x t)c +l(x tk , to k , r*l © t ) < j(x tk , 0 t ). (27) 
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If this is the case, the connection request 304 
is rejected (step 105) , otherwise the connection is 
accepted and "switched through" to the node j along the 
selected route r* (step 106) . 
5 Weights of the function approximator 400, 500 

which are adapted in the TD (X) -learning method to the 
training data, are stored in a parameter vector 9 for 
an instant t in each case, such that an optimized 
access control and an optimized routing are achieved. 

10 During the training phase, the weighting 

parameters are adapted to the training data applied to 
the function approximator. 

A risk parameter k is defined with the aid of 
which a desired risk, which the system has with regard 

15 to a prescribed state owing to a sequence of actions 
and states, can be set in accordance with the following 
rules : 

-1 < k < 0: risky learning, 

20 k = 0: neutral learning with regard to the 

risk, 

0 < k < 1: risk-avoiding learning, 

k = 1: worst-case learning. 

25 Furthermore, a prescribable parameter 0 < X < 1 

and a step size sequence y k are prescribed in the 
learning method. 

The weighting values of the weighting vector 0 

are adapted to the training data on the basis of each 
30 event co tk in accordance with the following adaptation 

rule : 

©k = ©k-l + YkN K (d k )z t , (28) 
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in which case 

d k = e*P( t >«" t k-l)(g(x t]c , o k , a t)c ) + j(x tk , ©k-l)) - j( x t k _!/ ©k-l) 

(29) 



2t = Xe"P( t k-l- t: lc-2) Zt _ 1 + VQjjxt^,©*.!), 



and 

N K (§) = (l - Ksign(§))§ 



(30) 



(31) 



It is assumed that: z_i = 0. 
The function 



g(*t k > <» k , a t}c ) 



(32) 



denotes the immediate gain in accordance with the 
following rule: 



g(xt k ><°k>atk) 



c (m) when co tk is a service request 

for a type of service m, and the 
connection is accepted 

0 otherwise (33) 



Thus, as described above, a sequence of actions 
is determined with regard to a connection request such 
that a connection request is either rejected or 
accepted on the basis of an action. The determination 
is performed taking account of an optimization function 
in which the risk can be set by means of a risk control 
parameter k e [-1; 1] in a variable fashion. 
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Second exemplary embodiment: Traffic management system 

Figure 6 shows a road 600 on which automobiles 
601, 602, 603, 604, 605 and 606 are being driven. 

Conductor loops 610, 611 integrated into the 
road 600 receive electric signals in a known way and 
feed the electric signals 615, 616 to a computer 620 
via an input/output interface 621. In an analog-to- 
digital converter 622 connected to the input/output 
interface 621, the electric signals are digitized into 
a time series and stored in a memory 623, which is 
connected by a bus 624 to the analog-to-digital 
converter 622 and a processor 625. Via the input /output 
interface 621, a traffic management system 650 is fed 
control signals 651 from which it is possible to set a 
prescribed speed stipulation 652 in the traffic 
management system 650, or else further particulars of 
traffic regulations, which are displayed via the 
traffic management system 650 to drivers of the 
vehicles 601, 602, 603, 604, 605 and 606. 

The following local state variables are used in 
this case for the purpose of traffic modeling: 

• traffic flow rate v, 

• vehicle density p (p = number of vehicles per 

Fz 

kilometer — ) . 

km 

• traffic flow q (q = number of vehicles per hour 
Fz 

— , (q= v * p) ) , and 
h 

• speed restrictions 652 displayed by the traffic 
management system 650 at an instant in each case. 

The local state variables are measured as 
described above by using the conductor loops 610, 611. 
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These variables (v(t), p(t), q(t)) therefore 
represent a state of the technical system of "traffic" 
at a specific instant t. 

In this exemplary embodiment, the system is 
therefore a traffic system which is controlled by using 
the traffic management system 650. 

In this second exemplary embodiment, an 
extended Q-learning method is described as method of 
approximative dynamic programming. 

The state x t is described by a state vector 

x(t>= (v(t> p(t> q(t)) . (34) 

The action a t denotes the speed restriction 
652, which is displayed at the instant t by the traffic 
management system 650. 

The gain r(x t , a t , x t+i ) describes the quality of 
the traffic flow which was measured between the 
instants t and t+1 by the conductor loops 610 and 611. 
In this second exemplary embodiment, r(x t , a t , x t +i) 
denotes 

• the average speed of the vehicles in the time 
interval [t, t+1] 

or 

• the number of vehicles which have passed the 
conductor loops 610 and 611 in the time interval 
[t, t + 1] 

or 

• the variance of the vehicle speeds in the time 
interval [t, t + 1] , 
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or 



a weighted sum from the above variables. 



A value of the optimization function OFQ is 
determined for each possible action a t / that is to say 
5 for each speed restriction which can be displayed by 
the traffic management system 650, an estimated value 
- of the optimization function OFQ being realized in each 
case as a neural network. 

This results in a set of evaluation variables 
10 for the various actions a t in the system state x t . 

Those actions a t for which the maximum 
evaluation variable OFQ has been determined in the 
current system state x t are selected in a control phase 
from the possible actions a t , that is to say from the 
15 set of the speed restrictions which can be displayed by 
the traffic management system 650. 

In accordance with this exemplary embodiment, 
the adaptation rule, known from the Q-learning method, 
for calculating the optimization function OFQ is 
20 extended by a risk control function K K (.), which takes 
account of the risk. 

In turn, the risk control parameter k is 
prescribed in accordance with the strategy from the 
first exemplary embodiment in the interval of 
25 [-1 < k < 1], and represents the risk which a user 
wishes to run in the application with regard to the 
control strategy to be determined. 

The following evaluation function OFQ is used 
in accordance with this exemplary embodiment: 



30 




(35) 
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• x = (v; p; q) denoting a state of the traffic 
system, 

• a denoting a speed restriction from the action 
space A of all speed restrictions which can be 
displayed by the traffic management system 650, 
and 

• w a denoting the weights of the neural network 
which belong to the speed restriction a. 

The following adaptation step is executed in Q- 
learning in order to determine the optimum weights w a 
of the neural network: 

+ l - * It • N K K) • Vo(xt; w?) (36) 

using the abbreviation 

d t = r(x t , a t , x t + i) + 7 max o(x t + i, w£) - o(x t , w^) (37) 

aeA 

• x t , x t +i denoting in each case a state of the 
traffic system in accordance with rule (34), 

• a t denoting an action, that is to say a speed 
restriction which can be displayed by the traffic 
management system 650, 

• y denoting a prescribable reduction factor, 

• denoting the weighting vector belonging to 
the action a t , before the adaptation step, 

• w t+l denoting the weighting vector belonging to 

the action a t , after the adaptation step, 

• r| t (t = 1, .-.) denoting a prescribable step size 
sequence, 



■a. S 
^1 
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• k e [-1; 1] denoting a risk control parameter, 

• fr< K denoting a risk control function K K (^) = 
(1 - Ksign(^) )£, 

• V Q (-;-) denoting the derivative of the neural 
5 network with respect to its weights, and 

• r(x t , a t/ x t +i) denoting a gain upon the transition 
in state from the state x t to the subsequent state 
x t +i. 

An action a t can be selected at random from the 
10 possible actions a t during learning. It is not 
necessary in this case to select the action a t which 
has led to the largest evaluation variable. 

The adaptation of the weights has to be 
performed in such a way that not only is a traffic 
15 control achieved which is optimized in terms of the 
expectation of the optimization function, but that also 
account is taken of a variance of the control results. 

This is particularly advantageous since the 
hii state vector x(t) models the actual system of traffic 

^ 20 only inadequately in some aspects, and so unexpected 

f*\ disturbances can thereby occur. Thus, the dynamics of 

U| the traffic, and therefore of its modeling, depend on 

|" further factors such as weather, proportion of trucks 

on the road, proportion of mobile homes, etc., which 
h= 25 are not always integrated in the measured variables of 

the state vector x(t). In addition, it is not always 
ensured that the road users immediately implement the 
new speed instructions in accordance with the traffic 
management system. 
30 A control phase on the real system in 

accordance with the traffic management system takes 
place in accordance with the following steps: 
1. The state x t is measured at the instant t at 
various points in the traffic system of traffic 
35 and yields a state vector x(t): = (v(t), p(t), 

q(t) ) . 
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2. A value of the optimization function is determined 
for all possible actions a t , and that action a t 
with the highest evaluation in the optimization 
function is selected. 
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